dplyr
packagehelp() to look up function
documentationNow that you’ve learned the basics of R programming, we’ll take things a step further and start working on your skills related to data analysis. You will likely be unfamiliar with some of the operations you need to execute in this exercise. Part of the goal with this exercise, however, is for you to become more familiar with the help commands in R and with the internet solutions that exist. Our ultimate goal is to make you aware of the tools that are available so that you can become an effective problem solver, working independently on data analyses.
So far, we’ve run code in the Console. This is fine for quick queries. For anything to be shareable, reproducible, and to reduce our own efforts in the long-term, we should run code via a script. An R script is a just a simple text file. R-Studio uses the script by copying R commands from text in the file and pastes them into the Console as if you were manually entering the commands yourself. This greatly enhances our ability to build off what we have created in the past, learn from previous experience, and quickly re-run analyses when new data are received. To create an R script:
R-Studio will open your R script automatically after creating it. Notice that the scripting window appears above the Console in what’s known as the Source pane.
Scripts folder. Name this file
IntroDataAnalysis.R.
At the top
of your script, provide brief information that describes the content of
your script. The content is up to you, but should briefly identify who
created the file, when the file was created, and what the script does.
This will help when you return to the file at a later date or if you
decide to share the file with a colleague in the future. Remember that
anything after the # symbol is a comment. Use this symbol
to make your code more readable, similar to below.
# ******************************************************************
# ******************************************************************
# Project: Introduction to Data Analysis in R
# Description: A script which details some basic commands on how to manipulate data
# Author: <Your Name>
# Date Initialized: <dd month yyyy>
# ******************************************************************
# ******************************************************************
Running code via an R script is different than running code in the Console. To interpret and run the code you’ve written, R needs you to send the code from the script to the Console. Some common ways to run your code include:
Ctrl + Enter (Windows) or command
+ return (Mac).Run button
in the top right of the Source pane with your mouse.You’ll find that your Environment (Workspace) in the upper right panel will quickly become full with user-defined objects. It’s generally good practice to work with a clean Workspace when starting a session. I generally start all my scripts with the following command to make sure you are starting fresh, something we will do to help develop good programming practices and reduce clutter.
# Clean your workspace/remove all objects
rm(list=ls())
# You can also remove a specific dataset using the following command
#rm(dataset)
DplyrThe most basic R skill is to query and manipulate data tables. As a beginner programmer, it is imperative to familiarize yourself with how to manipulate data. Reinforcing these skills is like expanding your vocabulary in the new language that you are learning and is a great way to improve your R proficiency. If you wish to become really good at R, but don’t know where to start, start with data table manipulation!
The base R functions that come with the default R installation have
the capacity for almost all the table manipulation needs (e.g.,
split(), subset(), apply(), sapply(), lapply(), tapply(), aggregate()).
However, sometimes their syntax are less user-friendly and intuitive
than some of the special packages built for table manipulation purposes.
So, here we are introducing a few of the most useful table manipulation
functions within dplyr package.
Note that you will have to use install.packages() and
library() function to download and activate the
dplyr before using it. You only need to install the package
once on your computer. You will need, however, to ‘activate’ the package
any time you want to use the functions that exist within the
package.
#install.packages("dplyr")
library(dplyr)
R has multiple functions for reading in table data. Here we’ll use
the base function read.csv() to import a table named
panda_data.csv that is located in your Data
folder. Text files (.txt) can be imported using the
function read.delim(). See the help files for each function
and search Google for information on other functions to read other data
types.
View the first few rows of the data table using the function
head() or click on the dataframe in the Environment/History
panel.
# Read dataset
panda_data <- read.csv(file="Data/panda_data.csv")
# Look at the data
head(panda_data)
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 da_da 4415463 5 100 m CD 99 50
## 2 mao_mao 4415522 4 120 f CD 70 30
## 3 lan_lan 4416073 7 95 f WL 80 NA
## 4 bei_bei 4416405 5 120 f WL 80 NA
## 5 bao_bao 4417779 5 110 m WL 75 60
## 6 tian_tian 4424490 4 89 m WL 95 70
## genetic_value3 year
## 1 60 2017
## 2 90 2017
## 3 80 2017
## 4 65 2017
## 5 76 2017
## 6 87 2017
Questions:
head() function?The function select() is a powerful tool for selecting
columns of interest. You must specify the dataset you want to query and
then also provide an expression for selecting columns of interest
(select(.data, expression)). A few examples are provided
below:
# select column called panda_name
select(panda_data, panda_name)
## panda_name
## 1 da_da
## 2 mao_mao
## 3 lan_lan
## 4 bei_bei
## 5 bao_bao
## 6 tian_tian
## 7 wei_wei
## 8 shuang_shuang
## 9 qiuqiu
## 10 lang_lang
# select all columns in the data except panda_name
select(panda_data, -panda_name)
## ID age weight_kg sex base genetic_value1 genetic_value2 genetic_value3
## 1 4415463 5 100 m CD 99 50 60
## 2 4415522 4 120 f CD 70 30 90
## 3 4416073 7 95 f WL 80 NA 80
## 4 4416405 5 120 f WL 80 NA 65
## 5 4417779 5 110 m WL 75 60 76
## 6 4424490 4 89 m WL 95 70 87
## 7 4424967 7 98 f CD 60 NA 92
## 8 4424657 8 110 m WL 92 NA 50
## 9 4423758 9 130 f CD 81 50 30
## 10 4427758 10 110 m WL 60 NA 70
## year
## 1 2017
## 2 2017
## 3 2017
## 4 2017
## 5 2017
## 6 2017
## 7 2017
## 8 2017
## 9 2018
## 10 2018
# select a range of columns, from age to sex
select(panda_data, age:sex)
## age weight_kg sex
## 1 5 100 m
## 2 4 120 f
## 3 7 95 f
## 4 5 120 f
## 5 5 110 m
## 6 4 89 m
## 7 7 98 f
## 8 8 110 m
## 9 9 130 f
## 10 10 110 m
Various selection helpers also exist, including:
starts_with: Expression select multiple columns that
start with the same text.ends_with(): Expression to select columns that end with
the same text.contains(): Expression to select columns that contain
the same text.matches(): Expression to select columns that match a
regular expression.one_of(): Expression to select columns that are from a
group of names.# select all columns that start with "genetic" in their column names
select(panda_data, starts_with("genetic"))
## genetic_value1 genetic_value2 genetic_value3
## 1 99 50 60
## 2 70 30 90
## 3 80 NA 80
## 4 80 NA 65
## 5 75 60 76
## 6 95 70 87
## 7 60 NA 92
## 8 92 NA 50
## 9 81 50 30
## 10 60 NA 70
Questions:
panda_name and
age.Filter() is similar to select(), except
that you are selecting specific rows that satisfy a requirement based on
a column value. This function is very similar to the base function
subset(). A few examples on how to use
filter() are provided below:
# Select rows where pandas are greater than or equal to 5 years of age
filter(panda_data, age >= 5)
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 da_da 4415463 5 100 m CD 99 50
## 2 lan_lan 4416073 7 95 f WL 80 NA
## 3 bei_bei 4416405 5 120 f WL 80 NA
## 4 bao_bao 4417779 5 110 m WL 75 60
## 5 wei_wei 4424967 7 98 f CD 60 NA
## 6 shuang_shuang 4424657 8 110 m WL 92 NA
## 7 qiuqiu 4423758 9 130 f CD 81 50
## 8 lang_lang 4427758 10 110 m WL 60 NA
## genetic_value3 year
## 1 60 2017
## 2 80 2017
## 3 65 2017
## 4 76 2017
## 5 92 2017
## 6 50 2017
## 7 30 2018
## 8 70 2018
# select rows that have age>5 OR weight_kg >100
filter(panda_data, age > 5 | weight_kg > 100)
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 mao_mao 4415522 4 120 f CD 70 30
## 2 lan_lan 4416073 7 95 f WL 80 NA
## 3 bei_bei 4416405 5 120 f WL 80 NA
## 4 bao_bao 4417779 5 110 m WL 75 60
## 5 wei_wei 4424967 7 98 f CD 60 NA
## 6 shuang_shuang 4424657 8 110 m WL 92 NA
## 7 qiuqiu 4423758 9 130 f CD 81 50
## 8 lang_lang 4427758 10 110 m WL 60 NA
## genetic_value3 year
## 1 90 2017
## 2 80 2017
## 3 65 2017
## 4 76 2017
## 5 92 2017
## 6 50 2017
## 7 30 2018
## 8 70 2018
# select rows that have age>5 AND base column has CD has entry
filter(panda_data, age > 5 & base == "CD")
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 wei_wei 4424967 7 98 f CD 60 NA
## 2 qiuqiu 4423758 9 130 f CD 81 50
## genetic_value3 year
## 1 92 2017
## 2 30 2018
# Select rows where the panda age is defined by a few values. Note the use of the %in% function.
filter(panda_data, age %in% c(4,5,7))
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 da_da 4415463 5 100 m CD 99 50
## 2 mao_mao 4415522 4 120 f CD 70 30
## 3 lan_lan 4416073 7 95 f WL 80 NA
## 4 bei_bei 4416405 5 120 f WL 80 NA
## 5 bao_bao 4417779 5 110 m WL 75 60
## 6 tian_tian 4424490 4 89 m WL 95 70
## 7 wei_wei 4424967 7 98 f CD 60 NA
## genetic_value3 year
## 1 60 2017
## 2 90 2017
## 3 80 2017
## 4 65 2017
## 5 76 2017
## 6 87 2017
## 7 92 2017
Questions:
1: Select rows with NA in the genetic_value2
column.
2: Select rows whose panda_name column are bao_bao or
bei_bei.
Piping (%>%) allows the user to combine the output
from one function to the input of another. Thus, instead of nesting
functions (reading from the inside to the outside), piping reads
functions from left to right (i.e., the way we normally read things). As
a result, reading piped code can be more intuitive and can help avoid
creating and saving a lot of intermediate variables that you don’t
need.
# Use a simple pipe to select the panda name and it's sex, and output the result.
# Create a new object named 'pipe_result'
pipe_result <- panda_data %>%
select(panda_name, sex) %>%
head()
# Output the result to the screen
pipe_result
## panda_name sex
## 1 da_da m
## 2 mao_mao f
## 3 lan_lan f
## 4 bei_bei f
## 5 bao_bao m
## 6 tian_tian m
Questions:
One of the most useful functions in dplyr package is
mutate(). Mutate allows the user to create new column(s),
populating the columns with values that you define or from information
in columns that already exist. You can also use mutate() to
control which columns are retained in the new object that you create by
setting the .keep argument.
# Create a new column, based on values from other columns that exist
# By default, keep = all
new_col_ex1 <- panda_data %>%
mutate(genetic_value_new = genetic_value1 - genetic_value2,
.keep = "all")
# Print
new_col_ex1
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 da_da 4415463 5 100 m CD 99 50
## 2 mao_mao 4415522 4 120 f CD 70 30
## 3 lan_lan 4416073 7 95 f WL 80 NA
## 4 bei_bei 4416405 5 120 f WL 80 NA
## 5 bao_bao 4417779 5 110 m WL 75 60
## 6 tian_tian 4424490 4 89 m WL 95 70
## 7 wei_wei 4424967 7 98 f CD 60 NA
## 8 shuang_shuang 4424657 8 110 m WL 92 NA
## 9 qiuqiu 4423758 9 130 f CD 81 50
## 10 lang_lang 4427758 10 110 m WL 60 NA
## genetic_value3 year genetic_value_new
## 1 60 2017 49
## 2 90 2017 40
## 3 80 2017 NA
## 4 65 2017 NA
## 5 76 2017 15
## 6 87 2017 25
## 7 92 2017 NA
## 8 50 2017 NA
## 9 30 2018 31
## 10 70 2018 NA
# You can create multiple columns at once. Best to put each new column on a separate line. This simply makes the code more readable.
new_col_ex2 <- panda_data %>%
mutate(genetic_dif = genetic_value1 - genetic_value2,
weight_g = weight_kg * 1000)
# Print
new_col_ex2
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 da_da 4415463 5 100 m CD 99 50
## 2 mao_mao 4415522 4 120 f CD 70 30
## 3 lan_lan 4416073 7 95 f WL 80 NA
## 4 bei_bei 4416405 5 120 f WL 80 NA
## 5 bao_bao 4417779 5 110 m WL 75 60
## 6 tian_tian 4424490 4 89 m WL 95 70
## 7 wei_wei 4424967 7 98 f CD 60 NA
## 8 shuang_shuang 4424657 8 110 m WL 92 NA
## 9 qiuqiu 4423758 9 130 f CD 81 50
## 10 lang_lang 4427758 10 110 m WL 60 NA
## genetic_value3 year genetic_dif weight_g
## 1 60 2017 49 100000
## 2 90 2017 40 120000
## 3 80 2017 NA 95000
## 4 65 2017 NA 120000
## 5 76 2017 15 110000
## 6 87 2017 25 89000
## 7 92 2017 NA 98000
## 8 50 2017 NA 110000
## 9 30 2018 31 130000
## 10 70 2018 NA 110000
Question:
panda_data called
zero and give it a value of 0?It is common to receive data in ways that we want to change or alter.
For instance, we might want to change the order of the columns, re-order
the rows based on particular column values, or we may want to change the
column headings. Mutate() includes options to change the
order of columns that you create (see .before and
.after in the help for the function).
Relocate() can be used move columns around using the
.before and .after commands,
arrange() can be used to order the rows, and
rename() can be used to give new names to column headings.
By default, arrange will arrange the selecting column in
ascending order. We can use desc() to arrange the column in
descending order.
# Arrange the panda_data by panda_name in descending order
# Place the sex column after the ID column
order_data_ex1 <- panda_data %>%
arrange(desc(panda_name)) %>%
relocate(sex, .after = ID)
# Print
head(order_data_ex1)
## panda_name ID sex age weight_kg base genetic_value1 genetic_value2
## 1 wei_wei 4424967 f 7 98 CD 60 NA
## 2 tian_tian 4424490 m 4 89 WL 95 70
## 3 shuang_shuang 4424657 m 8 110 WL 92 NA
## 4 qiuqiu 4423758 f 9 130 CD 81 50
## 5 mao_mao 4415522 f 4 120 CD 70 30
## 6 lang_lang 4427758 m 10 110 WL 60 NA
## genetic_value3 year
## 1 92 2017
## 2 87 2017
## 3 50 2017
## 4 30 2018
## 5 90 2017
## 6 70 2018
# Arrange the panda_data by genetic_value 1 and genetic_value2 after select the columns that start with 'genetic'
# Rename the genetic_value1 column to gen_val1
order_data_ex2 <- panda_data %>%
select(starts_with("genetic")) %>%
arrange(genetic_value1, genetic_value2) %>%
rename(gen_val1 = genetic_value1)
# Print
head(order_data_ex2)
## gen_val1 genetic_value2 genetic_value3
## 1 60 NA 92
## 2 60 NA 70
## 3 70 30 90
## 4 75 60 76
## 5 80 NA 80
## 6 80 NA 65
Question:
panda_wgt_kg.Summary statistics are easily calculated using the
summarise() function (note: summarize() will
also work). The summary statistics will be calculated among all rows in
the dataframe, unless specified otherwise. Summarize() is
often used in combination with group_by() (see below for
more information). Various useful summary functions are provided in the
help documentation for the function.
# Calculate the mean weight of the pandas and provide a count.
sum_table <- panda_data %>%
summarise(mean_wgt = mean(weight_kg),
n = n())
# Print summary table
sum_table
## mean_wgt n
## 1 108.2 10
# Calculate the avg weight and minimum weight. Don't create a new object.
panda_data %>%
summarise(avg_wt = mean(weight_kg),
min_wt = min(weight_kg))
## avg_wt min_wt
## 1 108.2 89
Group_by() can by used to divide data rows into groups
based on grouping column(s) that we specify. This function is often used
in combination with others which define what you do with them after
placing them in groups. When group_by() and
summarise() are used together, you are essentially telling
R to separate rows into different groups, and for each groups you use
summarise() to generate a series of summary statistics that
characterize the column values.
# Similar to above, calculate average and minimum weight, but summarize based on 'base' column. Include a count of each group.
panda_data %>%
group_by(base) %>%
summarise(avg_wt = mean(weight_kg),
min_wt = min(weight_kg),
n = n())
## # A tibble: 2 × 4
## base avg_wt min_wt n
## <chr> <dbl> <int> <int>
## 1 CD 112 98 4
## 2 WL 106. 89 6
# Group summaries can also be calculated across muultiple groups.
# Here, we calculate the same as above, but based on base and sex
panda_data %>%
group_by(base, sex) %>%
summarise(avg_wt = mean(weight_kg),
min_wt= min(weight_kg),
n = n())
## # A tibble: 4 × 5
## # Groups: base [2]
## base sex avg_wt min_wt n
## <chr> <chr> <dbl> <int> <int>
## 1 CD f 116 98 3
## 2 CD m 100 100 1
## 3 WL f 108. 95 2
## 4 WL m 105. 89 4
The last of the important tools we will cover in dplyr
is how to join tables together. This is common practice in ecological
statistics, with data in one table that are required to be appended to
another table for analyses. To join these tables, we need to link the
tables based on shared columns.
As an example, we will import a table
(panda_data_med.csv) located in our Data
folder that summarizes the vaccination history of each our pandas.
Similar to other functions, we have multiple options to join the tables
together. The most common join to use is a left_join(). In
this case, all the records in our first table (i.e., the left table) are
linked with those that match or are shared with the second data table
(i.e., the right table). The means that we will keep all the records in
the left table and include only those records that match in the right
table (they will be returned NA if no match exists).
Other join options, including inner_join(),
right_join, and full_join() also exist. Read
the help file to determine which join is most appropriate
to accomplish the activity of interest.
Note: Sometimes the column headings between tables
do not match, even if the data within the column does (ID
in table 1 != id in table 2). In these cases, you can
either rename() the column headings to make them match or
specify columns you want to join.
# Read in the vaccination table
panda_med <- read.csv(file="Data/panda_data_med.csv")
# Join all the rows in table 1 (panda_data) with table 2 (panda_med) to determine which of the pandas were vaccinated. Since ID exists in both tables, this is a straightforward join.
# Arrange the result by ID and year vaccinated
panda_join_ex1 <- panda_data %>%
left_join(panda_med, by = "ID") %>%
arrange(ID, year_vaccination)
# Print
head(panda_join_ex1)
## panda_name.x ID age weight_kg sex base genetic_value1 genetic_value2
## 1 da_da 4415463 5 100 m CD 99 50
## 2 mao_mao 4415522 4 120 f CD 70 30
## 3 mao_mao 4415522 4 120 f CD 70 30
## 4 lan_lan 4416073 7 95 f WL 80 NA
## 5 bei_bei 4416405 5 120 f WL 80 NA
## 6 bao_bao 4417779 5 110 m WL 75 60
## genetic_value3 year panda_name.y year_vaccination vaccine_type
## 1 60 2017 da_da 2018 canine distemper
## 2 90 2017 mao_mao 2017 canine distemper
## 3 90 2017 mao_mao 2018 rabies
## 4 80 2017 <NA> NA <NA>
## 5 65 2017 <NA> NA <NA>
## 6 76 2017 <NA> NA <NA>
# Example of how to join based on multiple fields and fields that don't exactly match
panda_join_ex2 <- panda_data %>%
left_join(panda_med,
by = c("ID" = "ID",
"year"= "year_vaccination"))
# Print
head(panda_join_ex2)
## panda_name.x ID age weight_kg sex base genetic_value1 genetic_value2
## 1 da_da 4415463 5 100 m CD 99 50
## 2 mao_mao 4415522 4 120 f CD 70 30
## 3 lan_lan 4416073 7 95 f WL 80 NA
## 4 bei_bei 4416405 5 120 f WL 80 NA
## 5 bao_bao 4417779 5 110 m WL 75 60
## 6 tian_tian 4424490 4 89 m WL 95 70
## genetic_value3 year panda_name.y vaccine_type
## 1 60 2017 <NA> <NA>
## 2 90 2017 mao_mao canine distemper
## 3 80 2017 <NA> <NA>
## 4 65 2017 <NA> <NA>
## 5 76 2017 <NA> <NA>
## 6 87 2017 <NA> <NA>
Text here